158 research outputs found
Efficiently Clustering Very Large Attributed Graphs
Attributed graphs model real networks by enriching their nodes with
attributes accounting for properties. Several techniques have been proposed for
partitioning these graphs into clusters that are homogeneous with respect to
both semantic attributes and to the structure of the graph. However, time and
space complexities of state of the art algorithms limit their scalability to
medium-sized graphs. We propose SToC (for Semantic-Topological Clustering), a
fast and scalable algorithm for partitioning large attributed graphs. The
approach is robust, being compatible both with categorical and with
quantitative attributes, and it is tailorable, allowing the user to weight the
semantic and topological components. Further, the approach does not require the
user to guess in advance the number of clusters. SToC relies on well known
approximation techniques such as bottom-k sketches, traditional graph-theoretic
concepts, and a new perspective on the composition of heterogeneous distance
measures. Experimental results demonstrate its ability to efficiently compute
high-quality partitions of large scale attributed graphs.Comment: This work has been published in ASONAM 2017. This version includes an
appendix with validation of our attribute model and distance function,
omitted in the converence version for lack of space. Please refer to the
published versio
Outlier Edge Detection Using Random Graph Generation Models and Applications
Outliers are samples that are generated by different mechanisms from other
normal data samples. Graphs, in particular social network graphs, may contain
nodes and edges that are made by scammers, malicious programs or mistakenly by
normal users. Detecting outlier nodes and edges is important for data mining
and graph analytics. However, previous research in the field has merely focused
on detecting outlier nodes. In this article, we study the properties of edges
and propose outlier edge detection algorithms using two random graph generation
models. We found that the edge-ego-network, which can be defined as the induced
graph that contains two end nodes of an edge, their neighboring nodes and the
edges that link these nodes, contains critical information to detect outlier
edges. We evaluated the proposed algorithms by injecting outlier edges into
some real-world graph data. Experiment results show that the proposed
algorithms can effectively detect outlier edges. In particular, the algorithm
based on the Preferential Attachment Random Graph Generation model consistently
gives good performance regardless of the test graph data. Further more, the
proposed algorithms are not limited in the area of outlier edge detection. We
demonstrate three different applications that benefit from the proposed
algorithms: 1) a preprocessing tool that improves the performance of graph
clustering algorithms; 2) an outlier node detection algorithm; and 3) a novel
noisy data clustering algorithm. These applications show the great potential of
the proposed outlier edge detection techniques.Comment: 14 pages, 5 figures, journal pape
On defining rules for cancer data fabrication
Funding: This research is partially funded by the Data Lab, and the EU H2020 project Serums: Securing Medical Data in Smart Patient-Centric Healthcare Systems (grant 826278).Data is essential for machine learning projects, and data accuracy is crucial for being able to trust the results obtained from the associated machine learning models. Previously, we have developed machine learning models for predicting the treatment outcome for breast cancer patients that have undergone chemotherapy, and developed a monitoring system for their treatment timeline showing interactively the options and associated predictions. Available cancer datasets, such as the one used earlier, are often too small to obtain significant results, and make it difficult to explore ways to improve the predictive capability of the models further. In this paper, we explore an alternative to enhance our datasets through synthetic data generation. From our original dataset, we extract rules to generate fabricated data that capture the different characteristics inherent in the dataset. Additional rules can be used to capture general medical knowledge. We show how to formulate rules for our cancer treatment data, and use the IBM solver to obtain a corresponding synthetic dataset. We discuss challenges for future work.Postprin
Seismic risk in the city of Al Hoceima (north of Morocco) using the vulnerability index method, applied in Risk-UE project
The final publication is available at Springer via http://dx.doi.org/10.1007/s11069-016-2566-8Al Hoceima is one of the most seismic active regions in north of Morocco. It is demonstrated by the large seismic episodes reported in seismic catalogs and research studies. However, seismic risk is relatively high due to vulnerable buildings that are either old or don’t respect seismic standards. Our aim is to present a study about seismic risk and seismic scenarios for the city of Al Hoceima. The seismic vulnerability of the existing residential buildings was evaluated using the vulnerability index method (Risk-UE). It was chosen to be adapted and applied to the Moroccan constructions for its practicality and simple methodology. A visual inspection of 1102 buildings was carried out to assess the vulnerability factors. As for seismic hazard, it was evaluated in terms of macroseismic intensity for two scenarios (a deterministic and probabilistic scenario). The maps of seismic risk are represented by direct damage on buildings, damage to population and economic cost. According to the results, the main vulnerability index of the city is equal to 0.49 and the seismic risk is estimated as Slight (main damage grade equal to 0.9 for the deterministic scenario and 0.7 for the probabilistic scenario). However, Moderate to heavy damage is expected in areas located in the newer extensions, in both the east and west of the city. Important economic losses and damage to the population are expected in these areas as well. The maps elaborated can be a potential guide to the decision making in the field of seismic risk prevention and mitigation strategies in Al Hoceima.Peer ReviewedPostprint (author's final draft
Discovering Polarized Communities in Signed Networks
Signed networks contain edge annotations to indicate whether each interaction
is friendly (positive edge) or antagonistic (negative edge). The model is
simple but powerful and it can capture novel and interesting structural
properties of real-world phenomena. The analysis of signed networks has many
applications from modeling discussions in social media, to mining user reviews,
and to recommending products in e-commerce sites. In this paper we consider the
problem of discovering polarized communities in signed networks. In particular,
we search for two communities (subsets of the network vertices) where within
communities there are mostly positive edges while across communities there are
mostly negative edges. We formulate this novel problem as a "discrete
eigenvector" problem, which we show to be NP-hard. We then develop two
intuitive spectral algorithms: one deterministic, and one randomized with
quality guarantee (where is the number of vertices in the
graph), tight up to constant factors. We validate our algorithms against
non-trivial baselines on real-world signed networks. Our experiments confirm
that our algorithms produce higher quality solutions, are much faster and can
scale to much larger networks than the baselines, and are able to detect
ground-truth polarized communities
A pathway to identifying and valuing cultural ecosystem services: an application to marine food webs
Beyond recreation, little attention has been paid thus far to economically value Cultural Ecosystem Services (CESs), especially in the context of coastal or marine environment. This paper develops and tests a pathway to the identification and economic valuation of CESs. The pathway enables researchers to make more explicit, and to economically value, cultural dimensions of environmental change. We suggest that the valuation process includes a simultaneous development of the scenarios of environmental change including related biophysical impacts, and a documentation of culture-environment linkages. A well-defined ecosystem service typology is also needed to classify cultural-ecological linkages as specific CESs. The pathway then involves the development of detailed, multidimensional depictions of the culture-environment linkages for use in a stated preference survey. The anticipated CES interpretations should be confirmed through debriefing questions in the survey questionnaire. The proposed approach is demonstrated with a choice experiment-based case study in Turkey that focuses improvements to the food web of the Black Sea. The results of this study indicate that economic preferences for CESs other than recreation can be estimated in a way that is economically consistent using the proposed approach
- …